264
17
Genomics
viruses, 20 which generally have very compact genomes. However, the reading frames
of eukaryotes are generally nonoverlapping (i.e., only the triplets AAG, TTC, …
would be available).
Due to the absence of unambiguous separators, the available structural information
in DNA is much more basic than in a human language. Even if the “meaning” of a
DNA sequence (a gene) that corresponds to a functional protein might be more
or less clear, especially in the case of prokaryotes, it must be remembered that
the sequence may be shot through with introns; even the stop codons (Table 7.1)
are not unambiguous. Only a small fraction (a few per cent) of eukaryotic genome
sequences actually corresponds to proteins (cf. Table 14.2), and any serious attempt to
understand the semantics of the genome must encompass the totality of its sequence.
Nucleotide Frequencies
Due to the lack of separators, it is necessary to work withnn-grams rather than words
as such. Basic information about the sequence is encapsulated in the frequency
dictionaries upper W Subscript nWn of the nn-grams, (i.e., lists of the numbers of occurrences of each
possible nn-gram). Each sequence can then be plotted as a point in upper M Superscript nMn-dimensional
space, where upper MM is the number of letters in the alphabet (equals 4= 4 for DNA, or 5 if we
include methylated cytosine as a distinct base).
Even such very basic information can be used to distinguish between different
genomes; for example, thermophilic organisms are generally richer in C and G,
because the C–G base-pairing is stronger and, hence, stabler at higher temperatures
than A–T. Furthermore, since each genome corresponds to a point in a particular
space, distances between them can be determined, and phylogenetic trees can be
assembled.
The four-dimensional space corresponding to the single base-pair frequencies is
not perhaps very interesting. Already the 16-dimensional space corresponding to the
dinucleotide frequencies is richer and might be expected to be more revealing. In
particular, given the single base-pair frequencies, one can compute the dinucleotide
frequencies expected from random assembly of the genome and determine diver-
gences from randomness. Dinucleotide bias is assessed, for example, by the odds
ratio w Subscript normal upper XρXY = wXY(wXwY), where w Subscript normal upper XwX is the frequency of nucleotide X. 21 We will
return to this comparison of actual with expected frequencies below.
Instead of representing the entire genome by a single point, one can divide it up
into roughly gene-long fragments (100–1000 base pairs), determine their frequency
dictionaries, and apply some kind of clustering algorithm to the collection of points
thereby generated. Alternatively, dimensional reduction using principal component
analysis (Sect. 13.2.2) may be adequate. The distributions of a single base-pair and
dinucleotide frequencies look like Gaussian clouds, but the triplet frequencies reveal
a remarkable seven-cluster structure. 22 It is natural to interpret the seven clusters as
the six possible reading frames (three in each direction) plus the “noncoding” DNA.
20 For example, Zaaijer et al. (2007).
21 See, e.g., Karlin et al. (1994).
22 Gorban et al. (2005).